In this I project I explore health insurance data, attempt to identify which factors are most important in predicting medical expenses, and perform linear regression analysis.
Below are the factors provided in the Medical Cost data set:


Let’s get started!

Loading in packages, removing objects from the environment and setting the working directory.

if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load(tidyverse, caTools, ggthemes, janitor, corrplot, GGally, psych)

rm(list=ls())
setwd("/Users/Jesse/R/MedInsuranceProject/")

Reading in the data.

df <- read_csv("/Users/Jesse/R/MedInsuranceProject/insurance.csv")

Let’s take a took at the data:

head(df)
## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.
str(df)
## spec_tbl_df [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr [1:1338] "female" "male" "male" "male" ...
##  $ bmi     : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
##  $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr [1:1338] "yes" "no" "no" "no" ...
##  $ region  : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   bmi = col_double(),
##   ..   children = col_double(),
##   ..   smoker = col_character(),
##   ..   region = col_character(),
##   ..   charges = col_double()
##   .. )

Let’s change the character columns to factors:

df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)

Finally, let’s check for any missing values:

any(is.na(df))
## [1] FALSE

No missing data! Let’s move on to some exploratory data analysis.


EDA

summary(df)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Let’s take a look at a histogram of the insurance charges:

df %>% ggplot(aes(charges)) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) + 
    ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) +
    geom_density(col="dark blue")


This distribution is right-skewed. Let’s make the distribution closer to normal:

df %>% ggplot(aes(log10(charges))) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) + 
    ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) + 
    geom_density(col = "dark blue")


Now let’s look at charges by region:

describeBy(df$charges,df$region)
## 
##  Descriptive statistics by group 
## group: northeast
##    vars   n     mean      sd   median  trimmed     mad    min      max    range
## X1    1 324 13406.38 11255.8 10057.65 11444.31 7806.78 1694.8 58571.07 56876.28
##    skew kurtosis     se
## X1 1.48     1.68 625.32
## ------------------------------------------------------------ 
## group: northwest
##    vars   n     mean       sd median  trimmed     mad     min     max    range
## X1    1 325 12417.58 11072.28 8965.8 10414.54 7001.14 1621.34 60021.4 58400.06
##    skew kurtosis     se
## X1 1.67     2.53 614.18
## ------------------------------------------------------------ 
## group: southeast
##    vars   n     mean      sd  median  trimmed     mad     min      max    range
## X1    1 364 14735.41 13971.1 9294.13 12563.65 8749.51 1121.87 63770.43 62648.55
##    skew kurtosis     se
## X1 1.24     0.48 732.28
## ------------------------------------------------------------ 
## group: southwest
##    vars   n     mean       sd  median  trimmed     mad     min      max
## X1    1 325 12346.94 11557.18 8798.59 10120.52 6329.39 1241.57 52590.83
##       range skew kurtosis     se
## X1 51349.26 1.67     2.03 641.08
df %>% ggplot(aes(region, charges)) + geom_boxplot(fill=c(3,4,6,7)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Region")


It looks like charges don’t differ much across regions, but the highest medical expenses are in the Southeast and the lowest are in the Southwest.

Let’s check out charges by smoker status:

describeBy(df$charges,df$smoker)
## 
##  Descriptive statistics by group 
## group: no
##    vars    n    mean      sd  median trimmed     mad     min      max    range
## X1    1 1064 8434.27 5993.78 7345.41 7599.76 5477.15 1121.87 36910.61 35788.73
##    skew kurtosis     se
## X1 1.53     3.12 183.75
## ------------------------------------------------------------ 
## group: yes
##    vars   n     mean       sd   median  trimmed      mad      min      max
## X1    1 274 32050.23 11541.55 34456.35 31782.89 15167.19 12829.46 63770.43
##       range skew kurtosis     se
## X1 50940.97 0.13    -1.05 697.25
df %>% ggplot(aes(smoker, charges)) + geom_boxplot(fill=c(3,2)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Smoker Status")


As you would expect, smokers spend much more than non-smokers. Smokers spend about 4x more compared to non-smokers when it comes to medical insurance expenses.

Next, let’s look at sex:

describeBy(df$charges,df$sex)
## 
##  Descriptive statistics by group 
## group: female
##    vars   n     mean      sd  median  trimmed     mad     min      max    range
## X1    1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92
##    skew kurtosis     se
## X1 1.72     2.71 432.53
## ------------------------------------------------------------ 
## group: male
##    vars   n     mean       sd  median trimmed     mad     min      max range
## X1    1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471
##    skew kurtosis     se
## X1 1.33     0.79 498.89
df %>% ggplot(aes(sex, charges)) + geom_boxplot(fill=c(5,7)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Sex")


It appears sex doesn’t seem to affect medical expenses very much.

Moving on to number of children:

describeBy(df$charges,df$children)
## 
##  Descriptive statistics by group 
## group: 0
##    vars   n     mean       sd  median  trimmed      mad     min      max
## X1    1 574 12365.98 12023.29 9856.95 10155.21 10067.29 1121.87 63770.43
##       range skew kurtosis     se
## X1 62648.55 1.53     1.95 501.84
## ------------------------------------------------------------ 
## group: 1
##    vars   n     mean       sd  median trimmed     mad     min      max    range
## X1    1 324 12731.17 11823.63 8483.87 10364.8 5859.46 1711.03 58571.07 56860.05
##    skew kurtosis     se
## X1 1.66     1.97 656.87
## ------------------------------------------------------------ 
## group: 2
##    vars   n     mean       sd  median  trimmed     mad  min      max    range
## X1    1 240 15073.56 12891.37 9264.98 12895.82 6587.43 2304 49577.66 47273.66
##    skew kurtosis     se
## X1 1.28     0.35 832.13
## ------------------------------------------------------------ 
## group: 3
##    vars   n     mean       sd   median  trimmed     mad     min     max
## X1    1 157 15355.32 12330.87 10600.55 13220.71 6918.06 3443.06 60021.4
##       range skew kurtosis     se
## X1 56578.33 1.45     1.21 984.11
## ------------------------------------------------------------ 
## group: 4
##    vars  n     mean      sd   median  trimmed    mad     min      max    range
## X1    1 25 13850.66 9139.22 11033.66 12401.81 7109.3 4504.66 40182.25 35677.58
##    skew kurtosis      se
## X1 1.45     1.59 1827.84
## ------------------------------------------------------------ 
## group: 5
##    vars  n    mean      sd  median trimmed     mad    min      max    range
## X1    1 18 8786.04 3808.44 8589.57 8402.35 3631.71 4687.8 19023.26 14335.46
##    skew kurtosis     se
## X1 1.04     0.54 897.66
df %>% ggplot(aes(as.factor(children), charges)) + geom_boxplot(fill=c(2:7)) + 
    theme_bw() + ggtitle("Medical Insurance Charges by Number of Children")


Interestingly, on average, people with 5 children spend less on medical expense than all other groups.

Let’s check out obesity now. First we need to create a new column for obese, which is greater than or equal to a BMI of 30.

df <- df %>%
    mutate(obese = as.factor(if_else(bmi >= 30, "yes", "no")))

describeBy(df$charges,df$obese)
## 
##  Descriptive statistics by group 
## group: no
##    vars   n     mean      sd  median trimmed     mad     min      max    range
## X1    1 631 10713.67 7843.54 8604.48 9772.74 7024.01 1121.87 38245.59 37123.72
##    skew kurtosis     se
## X1 0.97     0.23 312.25
## ------------------------------------------------------------ 
## group: yes
##    vars   n     mean       sd  median  trimmed     mad     min      max
## X1    1 707 15552.34 14552.32 9964.06 13451.03 7883.43 1131.51 63770.43
##       range skew kurtosis    se
## X1 62638.92 1.18     0.08 547.3
df %>% ggplot(aes(obese, charges)) + geom_boxplot(fill=c(6,4)) + 
    theme_bw() + ggtitle("Medical Charges by Obesity status (BMI 30+)")


As we can see, the average medical expenses for an obese individual is about $5000 more than non-obese people, while the median expenses between the group is similar.

Lastly, let’s look at age, but also with smoker status.

df %>% ggplot(aes(age, charges)) + geom_point(aes(color = smoker)) +
    theme_bw() + ggtitle("Medical Charges by Age and Smoker Status") + scale_color_manual(values=c("red", "blue"))


As expected, as people get older they tend to have higher medical expenses, but adding smoker status you can see the relationship between smoking and higher medical expenses.

Alright, let’s move on to looking at correlation between variables.


Correlation

num_cols <- sapply(df,is.numeric) # using only numeric features
ggpairs(df[,num_cols]) # matrix of plots

pairs.panels(df[,num_cols]) # matrix of plots using pairs.panels

We can see that age and charges have the highest correlation among the numeric variables. It’s also interesting to note that none of our numeric values are highly correlated with each other, so multicollinearity won’t be a problem.

Now let’s look at the correlation matrix for numeric features a different way.

cor_mx <- cor(df[,num_cols]) # correlation matrix
cor_mx
##                age       bmi   children    charges
## age      1.0000000 0.1092719 0.04246900 0.29900819
## bmi      0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges  0.2990082 0.1983410 0.06799823 1.00000000
corrplot(cor_mx, method = "color", title = "Correlation Matrix for Numeric Features", 
         mar=c(0,0,1,0))


Let’s move on and look at correlation between all features. But first we need to create dummy variables for sex and smoker.

df_dummy <- df %>%
    mutate(sex_dummy = if_else(sex == "male", 1, 0)) %>%
    mutate(smoker_dummy = if_else(smoker == "yes", 1, 0))

df_dummy <- df_dummy %>% 
    select(1,3,4,7,9,10)

# correlation all features
ggpairs(df_dummy)

pairs.panels(df_dummy)

dummy_cor_mx <- cor(df_dummy)
corrplot(dummy_cor_mx, method = "color", title = "Correlation Matrix for All Features", 
         mar=c(0,0,1,0))


We can see that smoking and medical expense are highly correlated.

Let’s move on to building a model using all of the original features, except using obese instead of BMI because BMI is essentially the same thing as obese.


Building a model

set.seed(214)
sample <- sample.split(df$charges, SplitRatio = 0.70)

# Training Data
train = subset(df, sample == TRUE)

# Testing Data
test = subset(df, sample == FALSE)

med_model <- lm(charges ~ .-bmi, data = train)
summary(med_model)
## 
## Call:
## lm(formula = charges ~ . - bmi, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12378.6  -3632.7   -411.7   1184.5  28190.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4148.16     760.41  -5.455 6.28e-08 ***
## age               269.45      14.26  18.890  < 2e-16 ***
## sexmale            57.44     402.38   0.143  0.88652    
## children          502.32     164.85   3.047  0.00238 ** 
## smokeryes       22867.71     504.00  45.373  < 2e-16 ***
## regionnorthwest  -587.99     580.23  -1.013  0.31115    
## regionsoutheast  -498.98     571.72  -0.873  0.38302    
## regionsouthwest -1270.60     580.84  -2.188  0.02895 *  
## obeseyes         3999.25     409.43   9.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6109 on 927 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7281 
## F-statistic:   314 on 8 and 927 DF,  p-value: < 2.2e-16
plot(med_model)


Just using the original variables we get a fairly good R-squared of 0.7304, which implies that about 73% of the variation in charges can be explained by the independent variables. Also, all variables except for region and sex, are statistically significant predictors of medical expense (p < .05).


Let’s attempt to improve the model:

First, let’s remove region and sex, as they were not statistically significant.

med_model2 <- lm(charges ~ . -bmi -sex -region, data = train)
summary(med_model2)
## 
## Call:
## lm(formula = charges ~ . - bmi - sex - region, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12644.8  -3631.6   -284.8   1183.8  28264.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4711.97     646.07  -7.293 6.46e-13 ***
## age           269.60      14.26  18.900  < 2e-16 ***
## children      490.02     164.61   2.977  0.00299 ** 
## smokeryes   22932.73     500.57  45.813  < 2e-16 ***
## obeseyes     3975.47     402.32   9.881  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6112 on 931 degrees of freedom
## Multiple R-squared:  0.729,  Adjusted R-squared:  0.7278 
## F-statistic: 626.1 on 4 and 931 DF,  p-value: < 2.2e-16
plot(med_model2)


This model is about the same, but slightly worse, with an R-squared of 0.729, still implying about 73% of the variation in charges can be explained by the independent variables. Let’s try a model with an interaction between obesity and smoker status.


med_model3 <- lm(charges ~ obese*smoker + age + children, data = train)
summary(med_model3)
## 
## Call:
## lm(formula = charges ~ obese * smoker + age + children, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19072.0  -2025.0  -1395.7   -604.1  24448.9 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2834.59     503.27  -5.632 2.36e-08 ***
## obeseyes              54.04     346.65   0.156    0.876    
## smokeryes          13077.17     548.52  23.841  < 2e-16 ***
## age                  272.33      10.99  24.779  < 2e-16 ***
## children             609.37     126.91   4.802 1.83e-06 ***
## obeseyes:smokeryes 19481.61     771.01  25.268  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4709 on 930 degrees of freedom
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8385 
## F-statistic: 971.6 on 5 and 930 DF,  p-value: < 2.2e-16
plot(med_model3)


This third model is much better than the first two, with an R-squared of .8393, implying about 84% of the variation in charges can be explained by the independent variables. Our Residual vs Fitted plot appears a little better, although the Normal Q-Q plot still indicates some problems with our fit.

We can interpret the model as follows:
  • For an obese non-smoker, we can expect $54.04 higher charges on average
  • For a non-obese smoker, we can expect $13,077.17 higher charges on average
  • For each year increase in age, we can expect a $272.33 increase in average charges
  • For a person who is an obese smoker, we can expect $54.04 + $13,077.17 + $19,481.61 = $32,612.82 higher charges on average


Finally, let’s make predictions

test$predicted <- predict(med_model3, newdata = test)
test %>%
  ggplot() + geom_point(aes(predicted, charges, color = smoker, shape = obese)) +
  geom_abline(color = "red") + ggtitle("Prediction vs Real Values")


It looks like our model is a pretty solid fit for our test data and our results are pretty accurate.

# calculating residuals 
test$residuals <- test$charges - test$predicted

# plot residuals 
test %>%
  ggplot() + geom_pointrange(aes(predicted, residuals, ymin = 0, ymax = residuals, color = smoker, shape = obese)) +
  geom_hline(yintercept = 0) + ggtitle("Residuals vs Fitted Values")


Overall, the residuals look pretty good, although there are several high residuals that could be caused by factors not included in the original data set.